Intro to research toolbox

Jeremy Lew

August 1, 2024

Topics

  1. Introduction & learning plan
  2. Getting comfortable with R
  3. Programming fundamentals
  4. Data cleaning
  5. Data visualisation
  6. Statistical analysis
  7. Presentation of results

Introduction & learning plan

Motivation

  1. Added tools to your toolbox
  2. Transferrable skill
  3. Learning how to program in 1 language helps you learn other languages
  4. Leverage on open source tools/projects written by others

Learning plan

  1. The topics in this deck mirror the workflow of a project

flowchart LR
  D[Data exported\nfrom REDCAP]-->C[Data cleaning]
  C-->S[Statistical analysis]
  S-->P[Presentation\nof results]

  1. The game plan is to first do a quick pass through from data cleaning to analysis:
    1. Take a toy dataset
    2. “Clean” the data (minimally)
    3. Run a regression model
  2. Once we have gotten comfortable with R, we will explore each topic in greater detail
    1. Data cleaning
    2. Programming fundamentals
    3. Analysis: generalised linear models
    4. Presentation of results

Resources

  1. R courses
  2. The Epidemiologist R Handbook
  3. R for Data Science
  4. Advanced R
  5. Cheat Sheets

Getting comfortable with R

What is an R package?

  1. It is a collection of code/data written by someone and packaged for distribution
  2. Hosted on the Comprehensive R Archive Network and/or Github for public download
  3. From the CRAN webpage, we are able to find a reference manual documenting the data/functions of the package
  4. From the Github repository, we are able to see the actual codes implemented by the package’s author(s)
  5. To install a package we would run install.packages("palmerpenguins") in the console
  6. To load a package, we would run library(palmerpenguins)
  7. To uninstall a package we would run remove.packages("palmerpenguins") in the console

[DEMO] Intro to RStudio workspace

  1. Console: where you run/execute lines of code

  2. To assign a variable, we use the <- operator

    x <- 1:10; print(x)
     [1]  1  2  3  4  5  6  7  8  9 10
    1. random-access memory is allocated when a variable is assigned/declared
    2. random-access memory is freed-up when the variable is deleted (or garbage collected if it goes out of scope)
  3. Environment: where you’re able to see variables stored in random-access memory

    1. To clear variables from environment \(\rightarrow\) execute rm(list = ls()) in console

[DEMO] Intro to RStudio workspace

  1. Script
    1. Writing code and saving the script
    2. Execute single/multiple lines of code \(\rightarrow\) highlight code + Ctrl-enter
    3. Runs from top to bottom
    4. Comments will not be executed
  2. Hotkeys
    1. Clear the console \(\rightarrow\) Ctrl-l
    2. Clear a line in console \(\rightarrow\) Ctrl-d

[PRACTISE] Tasks to do

  1. For this task, you will need the palmerpenguins and tidyverse packages. Install them if you have not already done so
  2. We will make use of the penguins dataset from palmerpenguins package
  3. Understand the data
  4. Dichotomise body_mass_g (continuous) into a categorical variable with 2 categories (“light”: <= 4750g and “heavy”: > 4750g)
  5. Run a linear regression1 with body_mass_g as the dependent variable and independent variables: species, bill_length_mm, bill_depth_mm, flipper_length_mm and sex
  6. Run a logistic regression with the categorical body mass variable you created earlier as the dependent variable, with the same independent variables as before

Programming Fundamentals

Data types

  1. Character data type e.g. "a"

    typeof("a")
    [1] "character"
  2. Integer data type e.g. 1L

    typeof(1L)
    [1] "integer"
  3. Numeric data type e.g. 5.0

    typeof(5.0)
    [1] "double"
  4. Logical data type e.g. TRUE or FALSE

    typeof(TRUE)
    [1] "logical"
  5. Complex data type e.g. 1 + 4i

    typeof(1 + 4i)
    [1] "complex"

Data types (continued)

  1. Factor datatype for working with categorical variables

    x <- factor(sample(c("Malay", "Others", "Indian", "Chinese"), 10, replace = T),
                levels = c("Chinese", "Malay", "Indian", "Others"))
    print(x)
     [1] Malay  Others Others Indian Malay  Malay  Others Others Malay  Indian
    Levels: Chinese Malay Indian Others
    print(levels(x))
    [1] "Chinese" "Malay"   "Indian"  "Others" 
  2. A factor variable is more useful than a plain character vector. E.g. the first level will be used by glm as the reference category

Data structures (vectors)

  1. Data structures are containers for holding data

  2. A character vector holds multiple characters

    letters
     [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s"
    [20] "t" "u" "v" "w" "x" "y" "z"
    typeof(letters)
    [1] "character"
  3. A numeric vector holds multiple numbers

    x <- 1:5; print(x)
    [1] 1 2 3 4 5
    typeof(x)
    [1] "integer"

Exercise

  1. Can you query the length of your character vector? length(letters)
  2. Can you slice the character vector to get the first 5 elements? letters[1:5]
  3. Can you slice the character vector to get the 1st, 9th, 20th elements? letters[c(1, 9, 20)]
  4. Can you flip the character vector from last to first element? letters[length(letters):1]

Data structures (lists)

  1. Vectors can only hold 1 kind of data type e.g. characters, numeric etc
  2. Lists are “containers” that can hold multiple data types
    1. Unnamed lists

      list("a", 1, 10:15)
      [[1]]
      [1] "a"
      
      [[2]]
      [1] 1
      
      [[3]]
      [1] 10 11 12 13 14 15
    2. Named lists

      list(a = "a", b = 1, c = 10:15)
      $a
      [1] "a"
      
      $b
      [1] 1
      
      $c
      [1] 10 11 12 13 14 15

Data structures (lists, continued)

Exercise

  1. Create a variable in your console: mylist <- list(a = "a", b = 1, c = 10:15)
  2. Can you query the length of mylist? length(mylist)
  3. Can you get the second element of your mylist using
    1. index: mylist[1]
    2. key: mylist$a or mylist[["a"]]
  4. Compare the difference between mylist[3] and mylist[[3]]
  5. Compare the difference between class(mylist[3]) and class(mylist[[3]])

Data structures (dataframe)

  1. Dataframe is a tabular container that can hold multiple data types like lists, but each column can only store data of the same data type

  2. To view the first 2 rows of the penguins dataset from the palmerpenguins package

    library(tidyverse)
    library(palmerpenguins)
    penguins %>% head(2)
    # A tibble: 2 × 8
      species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
      <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
    1 Adelie  Torgersen           39.1          18.7               181        3750
    2 Adelie  Torgersen           39.5          17.4               186        3800
    # ℹ 2 more variables: sex <fct>, year <int>
  3. To view the last 2 rows of the dataset

    penguins %>% tail(2)
    # A tibble: 2 × 8
      species   island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
      <fct>     <fct>           <dbl>         <dbl>             <int>       <int>
    1 Chinstrap Dream            50.8          19                 210        4100
    2 Chinstrap Dream            50.2          18.7               198        3775
    # ℹ 2 more variables: sex <fct>, year <int>
  4. To view the data in rstudio, execute view(penguins)

Data structures (dataframe, continued)

Exercise

  1. Can you query the dimensions of the penguins dataset using dim(penguins), ncol(penguins), nrow(penguins)?
  2. Can you get a glimpse of the data using the functions glimpse(penguins)?
  3. Can you view summary statistics of the data using summary(penguins)?

Data structures (dataframe, accessing variables)

  1. To get the column/variable-names of your dataset

    colnames(penguins)
    [1] "species"           "island"            "bill_length_mm"   
    [4] "bill_depth_mm"     "flipper_length_mm" "body_mass_g"      
    [7] "sex"               "year"             
  2. To access any column variable, we use the $ syntax

    penguins$species[1:10]
     [1] Adelie Adelie Adelie Adelie Adelie Adelie Adelie Adelie Adelie Adelie
    Levels: Adelie Chinstrap Gentoo
  3. To get a table count of a variable or a cross-table count of 2 variables

    table(penguins$sex, useNA = "ifany")
    
    female   male   <NA> 
       165    168     11 
  4. See Data Cleaning for more ways to work with dataframes

Data structures (dataframe, loading data)

  1. To import data from csv file, use read_csv
  2. To import data from an excel file, use read_excel
  3. To import data from SAS/SPSS/Stata or export from R to these file formats, use the haven package

Data structures (dataframe, adding labels)

  1. To add labels to a dataframe, we make use of the labelled package

Functions

  1. A good chapter on explaining functions is Chapter 6, Advanced R, where most of the examples in the slides are taken

  2. Example function

    f01 <- function(x, y) {
      # A comment
      x + y
    }
  3. Components of a function

    1. Arguments \(\rightarrow\) inputs to the function (there can also be functions with no arguments)
    2. Body \(\rightarrow\) the code inside the function
    3. Environment \(\rightarrow\) a list-like structure that stores/maps the variable name to value (a namespace as some people call it)
  4. Functions are objects that can be assigned or they can be anonymous functions

Function invocation

  1. How do we invoke the function?

    print(f01)
    function(x, y) {
      # A comment
      x + y
    }
    f01(3, 5)
    [1] 8
  2. How do we invoke one function after another?

    1. Suppose we have another function f02

      f02 <- function(a, b) {
        a * 10 + b
      }
    2. To call f01 first followed by f02 and sqrt we will “nest” the functions (inside-out, right to left)

      sqrt(f02(f01(1, 2), 5))
      [1] 5.91608

Function invocation (continued)

Another way to invoke a series of functions is to use the pipe operator %>% provided by the magrittr package

library(magrittr)
value <- 1
value %>%
  f01(2) %>%
  f02(5) %>%
  sqrt()
[1] 5.91608

Exercise

  1. What is the difference between f02 and f02(1, 2)?

Function output

  1. What is returned by a function? The last evaluated expression

  2. How do we store values outputted from a function? With assignment statment

  3. Early termination of a function with a return statement

Functions - argument passing

  1. Pass by copy vs pass by reference

  2. In R, all functions are pass by reference if you don’t modify the object subsequently within the function. i.e. copy-on-modify

  3. How to pass arguments?

    f04 <- function(x, y, z) {
      (x / y) * z
    }
    1. By position e.g. f04(10, 50, 20)

    2. By name e.g. f04(z = 20, y = 50, x = 10)

    3. Unpacking multiple named arguments using do.call

      do.call(f04, list(x = 10, y = 50, z = 20))
      [1] 4

Functions - lexical scoping, name masking

Names defined inside a function mask names defined outside a function

x <- 10
y <- 20
f05 <- function() {
  x <- 1
  y <- 2
  c(x, y)
}

f05()
[1] 1 2

Functions - lexical scoping, looking one level up

R will look up a variable in the enclosing scope (e.g. within the function) and if it’s not found, will continue to proceed upwards (e.g. enclosing function or global environment) to look for the variable until it is found

x <- 1
f06 <- function() {
  y <- 2
  i <- function() {
    z <- 3
    c(x, y, z)
  }
  i()
}

f06()
[1] 1 2 3

Exercise

Using the sample function, change the z variable inside the function and observe that…

z <- 10
f07 <- function(x, y) {
  # make your edit here
  z <- z * 100
  z + x + y
}

…there is no change to the z variable outside the function in the global environment

Functions - lexical scoping, dynamic lookup

R looks for the values when the function is run, not when the function is created

z <- 10
f08 <- function(x, y) {
  z + x + y
}

z <- 100
f08(5, 10)
[1] 115

Functions

Loops

Debugging

Memory Management

Data Cleaning

Common functions - packages

  1. These functions are from various packages, conveniently collected in the tidyverse package
  2. The best package for data wrangling is dplyr. Useful info can be found in this chapter
  3. A package for working with dates is lubridate
  4. A package for working with strings is stringr. This is useful for cleaning free-text response data
  5. A package for working with factor datatype is forcats

Common functions - dplyr::select

The select function enables you to select a subset of the columns in your dataset


# library(tidyverse) # if not already loaded
# library(penguins) # if not already loaded
penguins %>%
  select(species, island, bill_length_mm) %>%
  head(3)
# A tibble: 3 × 3
  species island    bill_length_mm
  <fct>   <fct>              <dbl>
1 Adelie  Torgersen           39.1
2 Adelie  Torgersen           39.5
3 Adelie  Torgersen           40.3


# there is also ends_with
penguins %>%
  select(starts_with("bill")) %>%
  head(3)
# A tibble: 3 × 2
  bill_length_mm bill_depth_mm
           <dbl>         <dbl>
1           39.1          18.7
2           39.5          17.4
3           40.3          18  


penguins %>%
  select(matches("mm")) %>%
  head(3)
# A tibble: 3 × 3
  bill_length_mm bill_depth_mm flipper_length_mm
           <dbl>         <dbl>             <int>
1           39.1          18.7               181
2           39.5          17.4               186
3           40.3          18                 195

Common functions - dplyr::filter

The filter function enables you to select a subset of the rows that meet a certain criteria

penguins %>%
  filter(body_mass_g > 3500) %>%
  head(3)
# A tibble: 3 × 8
  species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
  <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
1 Adelie  Torgersen           39.1          18.7               181        3750
2 Adelie  Torgersen           39.5          17.4               186        3800
3 Adelie  Torgersen           39.3          20.6               190        3650
# ℹ 2 more variables: sex <fct>, year <int>
penguins %>% filter(sex == "female")
# A tibble: 165 × 8
   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
 1 Adelie  Torgersen           39.5          17.4               186        3800
 2 Adelie  Torgersen           40.3          18                 195        3250
 3 Adelie  Torgersen           36.7          19.3               193        3450
 4 Adelie  Torgersen           38.9          17.8               181        3625
 5 Adelie  Torgersen           41.1          17.6               182        3200
 6 Adelie  Torgersen           36.6          17.8               185        3700
 7 Adelie  Torgersen           38.7          19                 195        3450
 8 Adelie  Torgersen           34.4          18.4               184        3325
 9 Adelie  Biscoe              37.8          18.3               174        3400
10 Adelie  Biscoe              35.9          19.2               189        3800
# ℹ 155 more rows
# ℹ 2 more variables: sex <fct>, year <int>

Common functions - dplyr::mutate

The mutate function enables you to add columns to your dataset. The added columns can be derived from existing column(s)

penguins %>%
  mutate(body_mass_100g = body_mass_g / 100) %>%
  select(body_mass_g, body_mass_100g) %>%
  head(5)
# A tibble: 5 × 2
  body_mass_g body_mass_100g
        <int>          <dbl>
1        3750           37.5
2        3800           38  
3        3250           32.5
4          NA           NA  
5        3450           34.5
penguins %>%
  mutate(bill_length_plus_depth_mm = bill_length_mm + bill_depth_mm) %>%
  select(matches("bill")) %>%
  head(5)
# A tibble: 5 × 3
  bill_length_mm bill_depth_mm bill_length_plus_depth_mm
           <dbl>         <dbl>                     <dbl>
1           39.1          18.7                      57.8
2           39.5          17.4                      56.9
3           40.3          18                        58.3
4           NA            NA                        NA  
5           36.7          19.3                      56  

Common functions - dplyr::group_by, summarise

  1. The summarise function enables you to get summary statistics like N, mean, median etc

    penguins %>% summarise(N = n(),
                           mean_body_mass_g = mean(body_mass_g))
    # A tibble: 1 × 2
          N mean_body_mass_g
      <int>            <dbl>
    1   344               NA
  2. The group_by function enables you to get summary statistics within groups

    penguins %>%
      group_by(sex) %>%
      summarise(N = n(),
                mean_body_mass_g = mean(body_mass_g),
                median_body_mass_g = median(body_mass_g))
    # A tibble: 3 × 4
      sex        N mean_body_mass_g median_body_mass_g
      <fct>  <int>            <dbl>              <dbl>
    1 female   165            3862.               3650
    2 male     168            4546.               4300
    3 <NA>      11              NA                  NA

Common functions - dplyr::arrange

The function arrange enables us to sort by a certain variable

penguins %>%
  group_by(sex, year) %>%
  summarise(mean_body_mass_g = mean(body_mass_g)) %>%
  arrange(desc(year))
# A tibble: 9 × 3
# Groups:   sex [3]
  sex     year mean_body_mass_g
  <fct>  <int>            <dbl>
1 female  2009            3874.
2 male    2009            4521.
3 <NA>    2009              NA 
4 female  2008            3888.
5 male    2008            4632.
6 <NA>    2008            4650 
7 female  2007            3821.
8 male    2007            4479.
9 <NA>    2007              NA 

Common functions - dplyr::if_else, case_when

The functions if_else and case_when are often used with mutate to create new variables

set.seed(2)
penguins %>%
  mutate(bill_length_type = if_else(bill_length_mm >= 48.5, "long bill", "short bill")) %>%
  sample_n(4)
# A tibble: 4 × 9
  species   island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
  <fct>     <fct>           <dbl>         <dbl>             <int>       <int>
1 Chinstrap Dream            43.5          18.1               202        3400
2 Gentoo    Biscoe           43.6          13.9               217        4900
3 Gentoo    Biscoe           48.1          15.1               209        5500
4 Gentoo    Biscoe           46.8          14.3               215        4850
# ℹ 3 more variables: sex <fct>, year <int>, bill_length_type <chr>
penguins %>%
  mutate(bill_length_type = case_when(bill_length_mm >= 48.5 ~ "long bill",
                                      bill_length_mm < 48.5 ~ "short bill",
                                      TRUE ~ NA_character_)) %>%
  select(bill_length_type) %>%
  table(useNA = "ifany")
bill_length_type
 long bill short bill       <NA> 
        87        255          2 

Common functions - forcats::fct_relevel

  1. If we convert a character vector to a factor vector, the levels will be arranged in alphabetical order automatically if unspecified

    penguins %>%
      mutate(bill_length_type = case_when(
        bill_length_mm < 40 ~ "short bill",
        between(bill_length_mm, 40, 50) ~ "medium bill",
        bill_length_mm > 50 ~ "long bill",
        TRUE ~ NA_character_) %>% factor()
      ) %>%
      .$bill_length_type %>% levels()
    [1] "long bill"   "medium bill" "short bill" 
  2. We can specify the levels of the factor using the levels argument

    penguins %>%
      mutate(bill_length_type = case_when(
        bill_length_mm < 40 ~ "short bill",
        between(bill_length_mm, 40, 50) ~ "medium bill",
        bill_length_mm > 50 ~ "long bill",
        TRUE ~ NA_character_) %>%
          factor(levels = c("short bill", "medium bill", "long bill"))
      ) %>%
      .$bill_length_type %>% levels()
    [1] "short bill"  "medium bill" "long bill"  
  3. The factor vector will only contain values specified by levels. Any existing character values unspecified in levels (e.g. “long bill”) will be converted to NA

    penguins %>%
      mutate(bill_length_type = case_when(
        bill_length_mm < 40 ~ "short bill",
        between(bill_length_mm, 40, 50) ~ "medium bill",
        bill_length_mm > 50 ~ "long bill",
        TRUE ~ NA_character_) %>%
          factor(levels = c("short bill", "medium bill"))
      ) %>%
      .$bill_length_type %>% levels()
    [1] "short bill"  "medium bill"
  4. fct_relevel relevels levels of a factor vector (if not already a factor eg. character vector, it will first convert it to a factor vector) Any unspecified levels will follow after specified ones, left in the existing order

    penguins %>%
      mutate(bill_length_type = case_when(
        bill_length_mm < 40 ~ "short bill",
        between(bill_length_mm, 40, 50) ~ "medium bill",
        bill_length_mm > 50 ~ "long bill",
        TRUE ~ NA_character_) %>%
          fct_relevel("short bill", "medium bill")
      ) %>%
      .$bill_length_type %>% levels()
    [1] "short bill"  "medium bill" "long bill"  

Common functions - forcats::fct_collapse

  1. We can collapse levels of a factor variable using forcat’s fct_collapse function

  2. The island variable has 3 levels

    penguins$island %>% table()
    .
       Biscoe     Dream Torgersen 
          168       124        52 
  3. Suppose we want to collapse island into 2 levels, we can apply fct_collapse like so

    penguins %>%
      mutate(island = fct_collapse(island, "Dream/Torgersen" = c("Dream", "Torgersen"))) %>%
      .$island %>%
      table()
    .
             Biscoe Dream/Torgersen 
                168             176 

Common functions - tidyr::pivot_wider

  1. Suppose we have some data in long format

    # Generate fake patient data for illustration
    generate_fake_patient_data <- function(id) {
      data.frame(list(
        patient_id = id,
        time = c("baseline", "12mth"),
        hba1c = runif(n = 2, min = 4, max = 14),
        ldl = sample(c("good control", "bad control"))
      ))
    }
    
    set.seed(2024)
    data <- map(
      sprintf("P%s", str_pad(1:344, width = 2, pad = 0)),
      generate_fake_patient_data
    ) %>%
      bind_rows()
    print(head(data, 10))
       patient_id     time     hba1c          ldl
    1         P01 baseline 12.369425 good control
    2         P01    12mth  7.208675  bad control
    3         P02 baseline  8.570092 good control
    4         P02    12mth 11.014203  bad control
    5         P03 baseline 12.765881  bad control
    6         P03    12mth  5.190482 good control
    7         P04 baseline 11.035531  bad control
    8         P04    12mth  9.157855 good control
    9         P05 baseline  5.103111  bad control
    10        P05    12mth 12.632152 good control
  2. We can make use of tidyr’s pivot_wider to reshape our data from long to wide

    data_wide <- data %>%
      pivot_wider(id_cols = patient_id,
                  names_from = "time",
                  names_glue = "{.value}_{time}",
                  values_from = c(hba1c, ldl)) 
    head(data_wide, 15)
    # A tibble: 15 × 5
       patient_id hba1c_baseline hba1c_12mth ldl_baseline ldl_12mth   
       <chr>               <dbl>       <dbl> <chr>        <chr>       
     1 P01                 12.4         7.21 good control bad control 
     2 P02                  8.57       11.0  good control bad control 
     3 P03                 12.8         5.19 bad control  good control
     4 P04                 11.0         9.16 bad control  good control
     5 P05                  5.10       12.6  bad control  good control
     6 P06                  4.47        5.12 bad control  good control
     7 P07                 12.0         6.92 good control bad control 
     8 P08                  4.01        4.21 bad control  good control
     9 P09                  5.76       13.7  bad control  good control
    10 P10                  4.54       13.7  good control bad control 
    11 P11                 12.7        13.1  good control bad control 
    12 P12                 12.5         8.48 bad control  good control
    13 P13                 10.5        10.5  bad control  good control
    14 P14                  8.53        8.89 bad control  good control
    15 P15                  4.20        5.92 good control bad control 

Common functions - tidyr::pivot_longer

  1. Suppose instead that we were given data in a wide format

    print(head(data_wide, 10))
    # A tibble: 10 × 5
       patient_id hba1c_baseline hba1c_12mth ldl_baseline ldl_12mth   
       <chr>               <dbl>       <dbl> <chr>        <chr>       
     1 P01                 12.4         7.21 good control bad control 
     2 P02                  8.57       11.0  good control bad control 
     3 P03                 12.8         5.19 bad control  good control
     4 P04                 11.0         9.16 bad control  good control
     5 P05                  5.10       12.6  bad control  good control
     6 P06                  4.47        5.12 bad control  good control
     7 P07                 12.0         6.92 good control bad control 
     8 P08                  4.01        4.21 bad control  good control
     9 P09                  5.76       13.7  bad control  good control
    10 P10                  4.54       13.7  good control bad control 
  2. We can make use of tidyr’s pivot_longer to reshape our data into long format

    data_wide %>%
      pivot_longer(
        cols = matches("baseline|12mth"),
        names_pattern = "(.*)_(baseline|12mth)",
        names_to = c(".value", "time")
      )
    # A tibble: 688 × 4
       patient_id time     hba1c ldl         
       <chr>      <chr>    <dbl> <chr>       
     1 P01        baseline 12.4  good control
     2 P01        12mth     7.21 bad control 
     3 P02        baseline  8.57 good control
     4 P02        12mth    11.0  bad control 
     5 P03        baseline 12.8  bad control 
     6 P03        12mth     5.19 good control
     7 P04        baseline 11.0  bad control 
     8 P04        12mth     9.16 good control
     9 P05        baseline  5.10 bad control 
    10 P05        12mth    12.6  good control
    # ℹ 678 more rows

Common functions - dplyr::bind_rows, bind_cols

Common functions - dplyr::mutate + across

  1. This combination of mutate & across enables us to apply a function to multiple columns of a dataframe

    penguins %>%
      mutate(across(.cols = c(bill_length_mm, bill_depth_mm), 
                    .fns = ~ floor(.x),
                    .names = "{.col}_floored")) %>%
      select(matches("bill_(length|depth)"))
    # A tibble: 344 × 4
       bill_length_mm bill_depth_mm bill_length_mm_floored bill_depth_mm_floored
                <dbl>         <dbl>                  <dbl>                 <dbl>
     1           39.1          18.7                     39                    18
     2           39.5          17.4                     39                    17
     3           40.3          18                       40                    18
     4           NA            NA                       NA                    NA
     5           36.7          19.3                     36                    19
     6           39.3          20.6                     39                    20
     7           38.9          17.8                     38                    17
     8           39.2          19.6                     39                    19
     9           34.1          18.1                     34                    18
    10           42            20.2                     42                    20
    # ℹ 334 more rows
  2. If we omit the .names argument, the function will be applied in-place

    penguins %>%
      mutate(across(.cols = c(bill_length_mm, bill_depth_mm), 
                    .fns = ~ floor(.x))) %>%
      select(matches("bill_(length|depth)"))
    # A tibble: 344 × 2
       bill_length_mm bill_depth_mm
                <dbl>         <dbl>
     1             39            18
     2             39            17
     3             40            18
     4             NA            NA
     5             36            19
     6             39            20
     7             38            17
     8             39            19
     9             34            18
    10             42            20
    # ℹ 334 more rows

Common functions - stringr::str_detect

  1. We can make use of regular expressions in stringr’s str_detect function together with filter, or if_else/case_when for data cleaning

  2. Here we create a species_plus variable

    set.seed(2024)
    penguins <- penguins %>%
      mutate(across(
        .cols = "species",
        .fns = ~ paste(.x,
          sample(c("baby", "teenager", "adult"), nrow(penguins), replace = TRUE),
          sex,
          sep = "_"
        ),
        .names = "{.col}_plus"
      ))
    # view counts of species_plus
    table(penguins$species_plus)
    
          Adelie_adult_female         Adelie_adult_male           Adelie_adult_NA 
                           22                        17                         3 
           Adelie_baby_female          Adelie_baby_male            Adelie_baby_NA 
                           27                        33                         1 
       Adelie_teenager_female      Adelie_teenager_male        Adelie_teenager_NA 
                           24                        23                         2 
       Chinstrap_adult_female      Chinstrap_adult_male     Chinstrap_baby_female 
                           14                        11                         7 
          Chinstrap_baby_male Chinstrap_teenager_female   Chinstrap_teenager_male 
                           11                        13                        12 
          Gentoo_adult_female         Gentoo_adult_male           Gentoo_adult_NA 
                           25                        20                         2 
           Gentoo_baby_female          Gentoo_baby_male            Gentoo_baby_NA 
                           21                        22                         1 
       Gentoo_teenager_female      Gentoo_teenager_male        Gentoo_teenager_NA 
                           12                        19                         2 
  3. We can use a regular expression to identify categories

    penguins %>%
      mutate(across(
        .cols = "species_plus",
        .fns = ~ case_when(
          # Chinstrap/Gentoo, not followed by baby, male/female
          str_detect(.x, "((Chinstrap|Gentoo)(?!_baby)).*(male|female)") ~ "DO SOMETHING",
          TRUE ~ .x # maintain existing value
        ),
        .names = "{.col}2" 
      )) %>%
      .$species_plus2 %>%
      table()
    .
       Adelie_adult_female      Adelie_adult_male        Adelie_adult_NA 
                        22                     17                      3 
        Adelie_baby_female       Adelie_baby_male         Adelie_baby_NA 
                        27                     33                      1 
    Adelie_teenager_female   Adelie_teenager_male     Adelie_teenager_NA 
                        24                     23                      2 
     Chinstrap_baby_female    Chinstrap_baby_male           DO SOMETHING 
                         7                     11                    126 
           Gentoo_adult_NA     Gentoo_baby_female       Gentoo_baby_male 
                         2                     21                     22 
            Gentoo_baby_NA     Gentoo_teenager_NA 
                         1                      2 

Data visualisation

The ggplot2 package is well known for plotting figures

Histograms

penguins %>%
  ggplot(aes(x = flipper_length_mm, fill = species)) + geom_histogram(bins = 40)

Facet wrap

penguins %>%
  ggplot(aes(x = flipper_length_mm, fill = sex)) +
    geom_histogram() +
    facet_wrap(~species, dir = "v", scale = "free_y")

Analysis: Descriptive statistics

Analysis: Bivariate tests for independent observations

Variable 1 Variable 2 Bivariate test
Categorical Categorical 1. Chi-square test
2. Fisher’s exact test
Categorical Continuous Parametric:
1. Independent t-test (2 categories)
2. One-way independent ANOVA (>2 categories)
Categorical Continuous Non-parametric:
1. Mann-Whitney test (2 categories) a.k.a. Wilcoxon rank-sum test
2. Kruskal-Wallis test (>2 categories)
Continuous Continuous Parametric: Pearson’s correlation coefficient
Continuous Continuous Non-parametric: Spearman’s correlation coefficient

Analysis: Bivariate tests for repeated measurements

Variable 1 Variable 2 Bivariate test
Categorical Categorical time
(e.g. baseline, 12-month)
McNemar’s test
Continuous Categorical time
(e.g. baseline, 12-month)
Parametric: Dependent t-test
Continuous Categorical time
(e.g. baseline, 12-month)
Non-parametric: Wilcoxon signed-rank test

Measures of central tendency

  1. Mean (SD), median (IQR) helps you get a sense of the distribution of characteristics in your study population with respect to levels of the outcome
  2. The tableby function of the arsenal package lets you do this easily
  3. We will talk about improving upon the formatting of the table in presenting tables
library(arsenal)

1tableby(species ~ sex + island + bill_length_mm,
        data = penguins,
        control = tableby.control(numeric.stats = c("Nmiss", "meansd", "medianq1q3", "range"))) %>%
summary(text = TRUE) %>%
knitr::kable(align = c("l", rep("r", 5)))
1
tableby function invoked by these 3 lines
Adelie (N=152) Chinstrap (N=68) Gentoo (N=124) Total (N=344) p value
sex 0.976
- N-Miss 6 0 5 11
- female 73 (50.0%) 34 (50.0%) 58 (48.7%) 165 (49.5%)
- male 73 (50.0%) 34 (50.0%) 61 (51.3%) 168 (50.5%)
island < 0.001
- Biscoe 44 (28.9%) 0 (0.0%) 124 (100.0%) 168 (48.8%)
- Dream 56 (36.8%) 68 (100.0%) 0 (0.0%) 124 (36.0%)
- Torgersen 52 (34.2%) 0 (0.0%) 0 (0.0%) 52 (15.1%)
bill_length_mm < 0.001
- N-Miss 1 0 1 2
- Mean (SD) 38.791 (2.663) 48.834 (3.339) 47.505 (3.082) 43.922 (5.460)
- Median (Q1, Q3) 38.800 (36.750, 40.750) 49.550 (46.350, 51.075) 47.300 (45.300, 49.550) 44.450 (39.225, 48.500)
- Range 32.100 - 46.000 40.900 - 58.000 40.900 - 59.600 32.100 - 59.600

Measures of central tendency (repeated measures)

  1. Suppose we have repeated measures of HbA1c and LDL from patients at e.g. baseline and 12-month

    generate_fake_patient_data <- function(id) {
      data.frame(list(
        patient_id = id,
        time = c("baseline", "12-month"),
        hba1c = runif(n = 2, min = 4, max = 14),
        ldl = sample(c("good control", "bad control"))
      ))
    }
    
    set.seed(2024)
    data <- map(
      sprintf("P%s", str_pad(1:30, width = 2, pad = 0)),
      generate_fake_patient_data
    ) %>%
      bind_rows() %>%
      mutate(across(.cols = time,
                    .fns = ~ fct_relevel(.x, "baseline", "12-month")))
    print(head(data, 10))
       patient_id     time     hba1c          ldl
    1         P01 baseline 12.369425 good control
    2         P01 12-month  7.208675  bad control
    3         P02 baseline  8.570092 good control
    4         P02 12-month 11.014203  bad control
    5         P03 baseline 12.765881  bad control
    6         P03 12-month  5.190482 good control
    7         P04 baseline 11.035531  bad control
    8         P04 12-month  9.157855 good control
    9         P05 baseline  5.103111  bad control
    10        P05 12-month 12.632152 good control
  2. The paired function from arsenal package helps us tabulate pre-post tests easily

    # pacman::p_load(arsenal, kableExtra)
    
    table <- paired(
      time ~ signed.rank(hba1c) + ldl,
      data = data,
      id = patient_id,
      control = paired.control(digits = 2,
                               numeric.stats = c("Nmiss2", "meansd", "medianq1q3", "range"),
                               numeric.test = "signed.rank",
                               signed.rank.correct = FALSE))
    
    table %>%
      summary(text = TRUE) %>%
      kable(align = c("l", rep("r", 4))) %>%
      column_spec(column = 1, width_min = "4cm") %>%
      column_spec(column = 2:4, width_min = "3cm") %>%
      column_spec(column = 5, width_min = "2.5cm")
    baseline (N=30) 12-month (N=30) Difference (N=30) p value
    hba1c 0.730
    - N-Miss 0 0 0
    - Mean (SD) 8.45 (3.09) 8.76 (3.09) 0.31 (4.78)
    - Median (Q1, Q3) 7.81 (5.80, 11.30) 8.66 (6.05, 11.78) 0.29 (-2.44, 3.74)
    - Range 4.01 - 12.77 4.21 - 13.73 -7.58 - 9.12
    ldl 0.855
    - bad control 14 (46.7%) 16 (53.3%) 14 (100.0%)
    - good control 16 (53.3%) 14 (46.7%) 16 (100.0%)
    tests(table) %>%
    kable()
    Group Variable p.value Method
    time hba1c 0.7303420 Wilcoxon signed rank exact test
    time ldl 0.8551321 McNemar’s Chi-squared test with continuity correction

Analysis: Generalised Linear Models (GLM)

How are categorical variables represented?

  1. Categorical variables are characters, but we need a numeric matrix in order to perform computations to get the \(\beta\) coefficients
set.seed(10)
idx <- sample(1:344, 5)
penguins[idx,]
# A tibble: 5 × 9
  species   island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
  <fct>     <fct>              <dbl>         <dbl>             <int>       <int>
1 Adelie    Dream               35.6          17.5               191        3175
2 Chinstrap Dream               50.7          19.7               203        4050
3 Adelie    Torgersen           39.7          18.4               190        3900
4 Gentoo    Biscoe              43.2          14.5               208        4450
5 Gentoo    Biscoe              47.2          13.7               214        4925
# ℹ 3 more variables: sex <fct>, year <int>, species_plus <chr>
  1. So one-hot encoding / dummy encoding will be applied to categorical variables
model.matrix(~species, penguins)[idx,]
    (Intercept) speciesChinstrap speciesGentoo
137           1                0             0
330           1                1             0
72            1                0             0
211           1                0             1
271           1                0             1

Framework

flowchart LR
  A[Model specification] --> B[Inference] --> C[Diagnostics]

  1. Model specification
    1. How to decide what type of model to fit?
    2. How to decide what variables to choose?
  2. Inference
    1. How do we get the value of the coefficients, confidence intervals, p-values?
  3. Diagnostics
    1. Does the model fit our data well?

Why the need for GLM?

  1. We want to find the association between an outcome (e.g. BMI) and some dependent variables (e.g. age, gender, ethnicity, social economic status)
  2. We are used to fitting a multivariable linear regression \[ \begin{align} \underbrace{BMI}_{\text{continuous}} &= \underbrace{\beta_0 + \beta_1age + \beta_2gender + \beta_3ethnicity + \cdots + \epsilon}_{\text{this linear combination is continuous}} \\ \epsilon &\sim N(0, \sigma^2) \end{align} \]
  3. What if your outcome is not continuous?
    1. E.g. outcome takes on values of 0 or 1 (logistic regression)
    2. E.g. outocme takes on discrete integer values 0, 1, 2, … (poisson regression)
  4. GLM gives you a way to relate the outome (which can take on various distributions) with a linear combination of dependent variables

Formal definition of GLM

GLM consists of 2 components

Component 1: What distribution does your outcome, \(y_i\) take on?

\[\begin{align} y_i|X_i \sim \text{member of exponential family} \ \text{(e.g. $y_i$ is normally distributed)} \end{align}\]

  1. \(i\) indexes the \(i\)-th observation of your dataset i.e. corresponds to a particular row of your dataset
  2. \(y_i\) represents the \(i\)-th observation of your outcome variable
  3. \(X_i\) represents the \(i\)-th observation of all your independent variables (e.g. age, gender, ethnicity)

Component 2: What is the link function \(g\) you want to use?

\[\begin{align} g(\mathbb{E}(y_i|X_i)) = X_i^{T} \beta = \underbrace{\beta_0 + \beta_1 age_i + \beta_2 gender_i + \beta_3 ethnicity_i + \cdots}_{\text{this linear combination is continuous}} \end{align}\]

  1. \(g\) is the link function that maps \(\mathbb{E}(y_i|x_i)\) to \(X_i^{T}\beta\)
  2. It is the link function that enables the mapping of the continuous \(X_i^{T}\beta\) to \(y_i\), which can be discrete for example
  1. In a generalised linear model, the outcome variable \(y_1,...,y_n\) are modelled as independent and identically distributed (iid) observations

  2. The outcome variable is treated as a random variable (i.e. outcome takes on a certain distribution e.g. normal), but NOT the independent variables

Family member 1: Linear regression

When our outcome, \(y_i\) is continuous and takes on real values (i.e. \(y_i \in \mathbb{R}\)), we may choose to model \(y_i\) with a normal distribution

Component 1: \(y_i|x_i\) is normally distributed

\[ y_i | x_i \sim \mathcal{N}(X_i^{T}\beta, \sigma^2) \]

Component 2: link function is the identity function i.e. no transformation done

\[ \mathbb{E}(y_i|x_i) = X_i^{T}\beta = \beta_0 + \beta_1age_i + \beta_2 gender_i + \beta_3 ethnicity_i + \cdots \]

Running a linear regression in R

  1. Notice the use of the formula syntax. LHS of the ~ is the dependent variable, while RHS are the independent variables
  2. Notice the family argument contains the 2 components of the GLM we discussed earlier
model <- 
  glm(body_mass_g ~ species + bill_length_mm +
                    bill_depth_mm + 
                    flipper_length_mm + sex,
      data = penguins,
      family = gaussian(link = "identity"))

summary(model)

Call:
glm(formula = body_mass_g ~ species + bill_length_mm + bill_depth_mm + 
    flipper_length_mm + sex, family = gaussian(link = "identity"), 
    data = penguins)

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
(Intercept)       -1460.995    571.308  -2.557 0.011002 *  
speciesChinstrap   -251.477     81.079  -3.102 0.002093 ** 
speciesGentoo      1014.627    129.561   7.831 6.85e-14 ***
bill_length_mm       18.204      7.106   2.562 0.010864 *  
bill_depth_mm        67.218     19.742   3.405 0.000745 ***
flipper_length_mm    15.950      2.910   5.482 8.44e-08 ***
sexmale             389.892     47.848   8.148 7.97e-15 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for gaussian family taken to be 82563.33)

    Null deviance: 215259666  on 332  degrees of freedom
Residual deviance:  26915647  on 326  degrees of freedom
  (11 observations deleted due to missingness)
AIC: 4723.9

Number of Fisher Scoring iterations: 2

Family member 2: Logistic regression

When our outcome, \(y_i\), is discrete and dichotomous (i.e. take two discrete values, 0-1, success-failure), we can model \(y_i\) with a Bernoulli distribution

Component 1: \(y_i|x_i\) is Bernoulli distributed

\[ y_i | x_i \sim \mathrm{Bernoulli}(\pi_i) \]

  • \(\pi_i\) is the probability of success of \(y_i\)
  • \(\mathbb{E}(y_i|x_i) = \pi_i\)

Component 2: Link function, \(g\), is the logit function

\[\begin{align} g(\mathbb{E}(y_i|x_i)) &= \underbrace{\mathrm{ln} \left( \frac{\mathbb{E}(y_i|x_i)}{1-\mathbb{E}(y_i|x_i)} \right)}_{logit[\mathbb{E}(y_i|x_i)]} = X_i^{T}\beta = \underbrace{\beta_0 + \beta_1 age_i + \beta_2 gender_i + \beta_3 ethnicity_i + \cdots}_{\text{this linear combination is continuous}} \end{align}\]

Running a logistic regression in R

  1. To be filled only after completion of exercise
  2. Code is largely similar to linear regression except for:
    1. family = binomial(link = "logit")

Inferring the coefficients

  1. Maximum likelihood estimation (MLE) is the way to infer the beta coefficients

  2. The idea behind MLE is that we want to maximise the joint probability of our observed data assuming that the data was generated according to our model i.e. we want to maximise the likelihood function \(\mathcal{L}(\beta) = P(y_1, y_2, ..., y_n)\).

  3. In the case of logistic regression, the likelihood function \(\mathcal{L}(\beta)\) is: \[\begin{align} \mathcal{L}(\beta) &= P(y_1, y_2, ..., y_n) \\ &= \prod_{i=1}^{n} P(y_i) \\ &= \prod_{i=1}^{n} \pi_i^{y_i} (1-\pi_i)^{1-y_i} \\ \end{align}\]

  4. Convert the likelihood to log-likehood to make it easier to work with (we can do that because log is a monotonic function) \[\begin{align} \mathrm{ln}\ \mathcal{L}(\beta) % &= \mathrm{ln} \prod_{i=1}^{n} \pi_i^{y_i} (1-\pi_i)^{1-y_i} \\ % &= \sum_{i=1}^{n} \left[ y_i\mathrm{ln}\pi_i + (1-y_i)\mathrm{ln}(1-\pi_i) \right] \\ % &= \sum_{i=1}^{n} \left[ y_i \mathrm{ln} \left( \frac{e^{\beta^{T} x_i}}{1+e^{\beta^{T} x_i}} \right) + (1-y_i) \mathrm{ln} \left( 1-\frac{e^{\beta^{T} x_i}}{1+e^{\beta^{T} x_i}} \right) \right] \\ % &= \sum_{i=1}^{n} \left[ y_i \beta^{T}x_i - y_i \mathrm{ln}(1+e^{\beta^{T}x_i}) + (1-y_i) \mathrm{ln} \left( \frac{1}{1+e^{\beta^{T}x_i}} \right) \right] \\ % &= \sum_{i=1}^{n} \left[ y_i \beta^{T}x_i - y_i \mathrm{ln}(1+e^{\beta^{T}x_i}) - (1-y_i) \mathrm{ln} (1 + e^{\beta^{T}x_i} ) \right] \\ % &= \sum_{i=1}^{n} \left[ y_i \beta^{T}x_i - \mathrm{ln}(1+e^{\beta^{T}x_i}) \right] \\ \end{align}\]

  5. In order to maximise the log-likelihood, we will use the iteratively reweighted least squares (IRLS) algorithm which is a Newton-Raphson like method \[\begin{align} \beta^{(t+1)} = \beta^{(t)} - \textit{H}^{-1}(\beta^{(t)}) \nabla_{\beta} \mathcal{L}(\beta^{(t)}) \end{align}\] The idea behind this is to update \(\beta\) via a series of iterations such that on every update, we are stepping in the direction of steepest ascent.

  6. To use the IRLS, first we need to calculate the gradient of the log-likelihood \[\begin{align} \nabla_{\beta} \mathrm{ln}\ \mathcal{L}(\beta) &= \sum_{i=1}^{n} \left[ y_i \nabla_\beta \beta^{T}x_i - \nabla_\beta \mathrm{ln}(1+e^{\beta^{T}x_i}) \right] \\ % &= \sum_{i=1}^{n} \left[ y_i x_i - \frac{e^{\beta^{T}x_i}}{1+e^{\beta^{T}x_i}} x_i \right] \\ % &= \sum_{i=1}^{n} (y_i - \pi_i) x_i \\ & = X^{T}(y - \pi) \end{align}\]

  7. Next, we will calculate the Hessian, \(H\), of the log-likelihood by taking derivatives one more time from the gradient. \[\begin{align} H &= -\sum_{i=1}^{n} \pi_i (1-\pi_i) x_i x_i^{T} \\ &= - X^{T} W X \\ \end{align}\]

    where \(W\) is a diagonal weight matrix given by: \[\begin{align} W = diag\{\pi_1 (1-\pi_1), \pi_2 (1-\pi_2), ..., \pi_n (1-\pi_n)\} \end{align}\]

  8. Now that we have the components, we can substitute these back into (5) to expand it out \[\begin{align} \beta^{(t+1)} = \beta^{(t)} + (X^{T}W^{(t)}X)^{-1}X^{T}(y - \pi^{(t)}) \end{align}\]

Interpret coefficients: logistic regression

  1. Step 1: Start from the equation linking outcome to indepenent variables \[\begin{align} \mathrm{ln}\left(\underbrace{\frac{\mathbb{E}(y_i|x_i)}{1-\mathbb{E}(y_i|x_i)}}_{odds_i}\right) = \beta_0 &+ \beta_1 age_i + \beta_2 gender^{(male)}_i \\ &+ \beta_3 ethnicity^{(malay)}_i + \beta_4 ethnicity^{(indian)}_i + \cdots \end{align}\]

  2. Step 2: Analyse unit changes

    1. Continuous example: 1 unit increase in age, with everything else held constant. The exponent of \(\beta_1\) gives us the odds ratio for every unit increase in age. \[\begin{align} \mathrm{ln}\ odds_i|_{age=a+1} - \mathrm{ln}\ odds_i|_{age=a} &= \beta_1 (a+1) - \beta_1 (a) \\ \mathrm{ln}\ \frac{odds_i|_{age=a+1}}{odds_i|_{age=a}} &= \beta_1 \\ \frac{odds_i|_{age=a+1}}{odds_i|_{age=a}} &= e^{\beta_1} \end{align}\]

    2. Categorical example: compare to reference category, with everything else held constant. The exponent of \(\beta_3\) gives us the odds ratio of malay compared to chinese. \[\begin{align} \mathrm{ln}\ odds_i|_{ethnicity=malay} - \mathrm{ln}\ odds_i|_{ethnicity=chinese} &= \beta_3 \\ \mathrm{ln}\ \frac{odds_i|_{ethnicity=malay}}{odds_i|_{ethnicity=chinese}} &= \beta_3 \\ \frac{odds_i|_{ethnicity=malay}}{odds_i|_{ethnicity=chinese}} &= e^{\beta_3} \end{align}\]

Family member 3: Gamma regression

When our outcome \(y_i\), is continuous and positive, we can model \(y_i\) with a Gamma distribution

Component 1: \(y_i\) is Gamma distributed

\[ y_i | X_i \sim Gamma(\alpha, \beta) \]

Component 2: One choice of the link function, \(g\), is the ln function

\[ \begin{align*} g(\mathbb{E}(y_i|X_i)) = \mathrm{ln}\ \mathbb{E}(y_i|X_i) = \beta_0 + \beta_1 age_i + \beta_2 gender_i + \beta_3 ethnicity_i + \cdots \\ \end{align*} \]

Notes:

  1. The inverse of the ln function is the exponent function. Alternatively, we can also express \(\mathbb{E}(y_i|x_i) = e^{\beta_0 + \beta_1 age_i + \beta_2 gender_i + \beta_3 ethnicity_i + \cdots}\)

  2. From this expression, we can see that the inverse of the link function (i.e. exponent) maps \(X_i^{T}\beta\) to \(\mathbb{E}(y_i|X_i)\). And we know that this is a valid function because \(\mathbb{E}(y_i|X_i)=e^{X_i^{T}\beta} > 0\)

Interpret coefficients: gamma regression

  1. Step 1: Start from the equation linking outcome to independent variables \[\begin{align} \mathrm{ln}\ \mathbb{E}(y_i|X_i) = \beta_0 &+ \beta_1 age_i + \beta_2 gender^{(male)}_i \\ &+\beta_3 ethnicity^{(malay)}_i + \beta_4 ethnicity^{(indian)}_i + \cdots \\ % \end{align}\]
  2. Step 2: Analyse unit changes
    1. Continuous example: 1 unit increase in age, with everything else held constant. The percentage change in mean outcome for every unit increase in age can be given from \(e^{\beta_1} - 1\) \[\begin{align} \mathrm{ln}\ \mathbb{E}(y_i|X_i)|_{age=a+1} - \mathrm{ln}\ \mathbb{E}(y_i|X_i)|_{age=a} &= \beta_1 (a+1) - \beta_1(a) \\ % \mathrm{ln}\ \frac{\mathbb{E}(y_i|X_i)|_{age=a+1}}{ \mathbb{E}(y_i|X_i)|_{age=a}} &= \beta_1 \\ % \frac{\mathbb{E}(y_i|X_i)|_{age=a+1}}{ \mathbb{E}(y_i|X_i)|_{age=a}} -1 &= e^{\beta_1} - 1 \end{align}\]
    2. Interaction terms example: compare vs reference category, age at some value \(a\), with everything else held constant
      1. The percentage change in mean outcome for male compared to female can be given from \(e^{\beta_2 + \beta_3 a} - 1\)
      2. Notice now that this percentage change which was \(\beta_2\) (if no interaction had been defined) is now modified by \(\beta_3 a\) which depends on age \[\begin{align} \mathrm{ln}\ \mathbb{E}(y_i|X_i)|_{age=a, gender=male} - \mathrm{ln}\ \mathbb{E}(y_i|X_i)|_{age=a, gender=female} &= (\beta_1 a + \beta_2 + \beta_3 a) - \beta_1 a \\ % \mathrm{ln}\ \frac{\mathbb{E}(y_i|X_i)|_{age=a, gender=male}}{ \mathbb{E}(y_i|X_i)|_{age=a, gender=female}} &= \beta_2 + \beta_3 a \\ % \frac{\mathbb{E}(y_i|X_i)|_{age=a, gender=male}}{ \mathbb{E}(y_i|X_i)|_{age=a, gender=female}} -1 &= e^{\beta_2 + \beta_3 a} - 1 \end{align}\]

Family member 4: Conway-Maxell-Poisson (CMP) regression

  1. When our outcome variable is a “count”, the poisson regression model is used because the poisson distribution models counts better than say linear regression which assumes that the outcome variable is normally distributed
  2. However, the poisson regression model is restrictive as it assumes equi-dispersion in our outcome variable (i.e. mean of outcome variable approximately equals the variance of the outcome variable) When there is under- or over-dispersion, the poisson distribution no longer models the outcome variable well.
  3. When there is over-dispersion (variance of outcome > mean of outcome), negative binomial regression is often used
  4. When there is under-dispersion (i.e. variance of outcome < mean of outcome), we will use the Conway-Maxwell-Poisson regression model
  5. Useful resources: (i) Video explanations see1,2 (ii) Sellers & Premeaux3 explains the CMP regression and surveys available statistical packages

CMP distribution

  1. The Conway-Maxwell-Poisson distribution differs from the Poisson distribution with the introduction of a parameter \(\nu\)
  2. The \(\nu\) parameter enables the Conway-Maxwell-Poisson distribution to model a wide range of dispersion, and generalises well-known distributions
    1. When \(\nu = 1\), it takes on the Poisson distribution (equi-dispersion)
    2. When \(\nu = 0\) and \(\lambda < 1\), it takes on the geometric distribution (over-dispersion)
    3. When \(\nu \rightarrow \infty\), it takes on the bernoulli distribution (under-dispersion)

CMP regression

  1. Use of the Conway-Maxwell-Poisson (CMP) distribution in the generalised linear model (GLM) setting:
    1. In the original CMP formulation of Sellers and Shmueli4, the relationship between dependent variable \(Y\) and independent variables \(X\) is given by the equations:
      1. \(ln(\lambda_{i}) = X_{i}^T\beta\)
      2. \(\mathbb{E}(Y_{i}) = \lambda_{i} \frac{\partial ln Z(lambda_{i}, \nu)}{\partial \lambda_i} \approx \lambda_{i}^{\frac{1}{\nu}} - \frac{\nu - 1}{2\nu}\), where \(Z(\lambda_i, \nu_i)\) is a normalizing constant
  2. However, the CMP regression model is not amenable to interpretation of coefficients by mean contrasts, because \(\lambda_i\) is related to \(\mathbb{E}(Y_i)\) by a non-linear function
  3. To overcome this, Huang 20175 reparameterised the CMP distribution to model the mean directly. He called it CMP\(\mu\)

CMP\(\mu\)-regression

  1. Under the CMP\(\mu\)-regression formulation, the relationship between dependent variable \(Y\) and independent variables \(X\) is given by \(ln(\mathbb{E}(Y_i|X)) = X_{i}^T\beta\)
  2. The convenient thing about CMP-regression and CMP\(\mu\)-regression is that we can conduct a likelihood ratio test4,5 to see if a poisson regression model (poisson regression is a special case of CMP-regression when \(\nu = 1\)) is adequate (i.e. \(H_0: \nu = 1\) vs \(H_1: \nu \ne 1\))
  3. Note that \(H_1\) does not specify the direction (under vs over) of data dispersion. This can be assessed via the dispersion estimate \(\hat{\nu}\)4 (i.e. \(\hat{\nu} > 1\) if under-dispersion and \(\hat{\nu} < 1\) if over-dispersion)
  4. Model diagnostics5: If the underlying distributional model is correct, the probability inverse transformation (PIT) should resemble a random sample from a standard uniform distribution. Goodness-of-fit can then be assessed by plotting a histogram of the PIT, or via a quantile plot of the PIT against the uniform distribution

Running a CMP\(\mu\)-regression in R

We will use the mpcmp6 R package’s implementation of the CMP\(\mu\)-regression

Analysis: Concordance

What is the Kappa statistic?

  1. The Kappa statistic is a measure of “true” agreement. It indicates the proportion of agreement beyond that expected by chance7
  2. If prevalence index is high, chance agreement will be higher, kappa is reduced
  3. If bias index is high, chance agreement will be lower, kappa is higher

Calculation of Kappa statistic

  1. Calculation of prevalence-adjusted, bias-adjusted Kappa can be done through the epiR package

Analysis: Tests of reliability

Internal Consistency

  1. Cronbach’s alpha is a measure of internal consistency of questionnaire
  2. Interpreting Cronbach’s alpha
    1. Cronbach’s alpha between 0.70 to 0.80 is considered good enough8
    2. Cronbach’s alpha is shown as raw_alpha in the output
  3. Caveats:
    1. All else equal, alpha will increase as the number of items on the scale increases8 (page 799, section 17.8.2)
    2. If responses lack variability (i.e. almost everyone gives the same response), alpha tends to be poor8 (page 803)
    3. “If alpha is too high it may suggest that some items are redundant as they are testing the same question but in a different guise. A maximum alpha value of 0.90 has been recommended”9

Test-retest reliability

  1. We will use the Intraclass Correlation Coefficient to assess for test-retest reliability
  2. Out of the 10 variants of ICC, we will compute the “two-way mixed effects, absolute agreement, single rater/measurement” ICC variant or ICC (A,1) of McGraw and Wong10 per recommendations from11,12:
  3. When it comes to interpreting the ICC, according to Koo et al.11:
    1. ICC less than 0.5: poor reliability
    2. ICC between 0.5 and 0.75: moderate reliability
    3. ICC between 0.75 and 0.90: good reliability
    4. ICC greater than 0.90: excellent reliability

Analysis: Comparative effectiveness research using observational data

What is comparative effectiveness research?

  1. Comparative effectiveness research is the generation and synthesis of evidence that compares the benefits and harms of alternative methods to prevent, diagnose, treat, and monitor a clinical condition or to improve the delivery of care13
  2. Effectiveness vs efficacy
    1. Effectiveness refers to the performance of an intervention in “real-world” clinical settings while efficacy refers to the performance of an intervention under “ideal” circumstances13
  3. Examples:
    1. Compare the effectiveness of different treatment options14
    2. Compare the effectiveness of a health intervention programme vs usual care group15

Target trial framework

  1. Suppose you are interested in conducting a pragmatic randomised controlled trial (RCT) to evaluate your health intervention vs usual care
  2. However, conducting this trial is infeasible so we resort to emulating this trial with observational data
  3. The target trial framework16,17 gives us guidelines to conceptualise our study
  4. An example of the application of the target trial framework18 Table III from Cote et al.

Target trial framework pointers

  1. Extra pointers adapted from16,19
Protocol component Observational study that emulates the RCT
Eligibility criteria • Apply the same inclusion/exclusion criteria
• Assuming that we have variables needed for inclusion/exclusion in our observational dataset
• Eligibility should only be determined based on info available at baseline (cf. immortal time bias)
• Selection bias if we exclude patients with missing data (analogue of dropout/loss to follow-up in a trial) if dropout is associated with intervention and outcome
• Patients could be eligible at multiple time points. We have a choice of (1) single eligible time: set baseline as the first timepoint when eligibility is met within a time period or (2) all eligible times (or large subset): emulate multiple nested trials
Treatment strategies • Assign patients to strategy consistent with their baseline data
• Ensure treatment arms are consistent e.g. both arms contain only new initiators of a treatment as opposed to one arm having patients who are already receiving treatment some time before baseline, which affects the outcome
• Positivity: “each individual should have positive probability of receving each level of exposure for every combination of covariates” (i.e. a patient should not be precluded from a level of exposure to the intervention at the outset)
• Consistency: “there cannot be two versions of a treatment that would result in different outcomes”
Assignment procedures • Can only emulate trials without blind assignment because individuals/care providers in the dataset are aware of the received treatments
• Patients are not randomly assigned to treatment strategies. methods to achieve exchangeability at baseline e.g. propensity score matching, inverse probability weighting and g-methods to deal with time-varying confounding
• Limitation: residual confounding from unmeasured confounders
Follow-up period
Outcome
Causal contrast of interests • Intention-to-treat effect: we would use patients who were prescribed medication for treatment initiation
• Per-protocol effect: we would use patients who adhered to the treatment. Adjustment for postbaseline confounding is needed because postbaseline prognostic factors associated with subsequent adherence to treatment may be affected by prior adherence
Analysis plan • Proper definition of baseline/time zero is important (affects assessment of eligibility)
• Best way to define time zero is the time time when an eligible individual initiates a treatment strategy

Propensity score matching

  1. What is a propensity score?
  2. What is the purpose of propensity score matching?
  3. How is propensity score estimated?
  4. How is propensity score matching done?

How do we choose matching variables?

How do we evaluate covariate balance after matching?

  1. We make use of the absolute standardized mean difference (SMD) metric
  2. Rule of thumb
  3. p-values are not used because

Analysis: Intervention effect estimation

Analysis: Missing data

Outline

  1. Why do we care about missing data?
  2. What are the types of missingness mechanisms?
  3. What are the implications of each missingness mechanism?
  4. Multiple imputation as a method for imputing missing data

Motivation: why do we care about missing data?

  1. Commonly used statistical models (e.g. GLM) require complete-cases \(\rightarrow\) forced to discard data points
  2. Data is expensive to collect
  3. In throwing out data, you will face the problem of:
    1. Efficiency: number of data points are reduced \(\rightarrow\) lose statistical power
    2. Bias: you might systematically exclude a population group and end up with a biased dataset
    3. Both bias and efficiency will influence your final conclusion

Missingness mechanism

  1. A missingness mechanism refers to the underlying process which generated the missing data
  2. There are 3 missingness mechanisms
    1. Missing completely at random (MCAR)
    2. Missing at random (MAR)
    3. Missing not at random (MNAR)

Framework for understanding missingness mechanism

  1. We will use the framework as laid out in Mohan & Pearl20
  2. Reason: Clearest way to discriminate between MCAR, MAR, MNAR
  3. Prerequisite: This requires an understanding of the concept of d-separation in graphical models

Framework notation

Suppose we have data in the following format20

  • \(A\): Age
  • \(G\): Gender
  • \(O\): Obesity
  • \(O^*\): Obesity
  • \(R_O\): Missing indicator
  1. Fully observed variables are shaded (e.g. \(A\), \(G\))
  2. Partially observed variables (with missing data) are non-shaded (e.g. \(O\))
  3. To model the missingness process, two variables are introduced:
    1. \(R_O\) - A binary {0, 1} masking variable that governs missingness
    2. \(O^*\) - A proxy variable for obesity that is fully observed \[ O^* = f(R_O, O) = \begin{cases} O & \text{if } R_O = 0 \\ m & \text{if } R_O = 1 \end{cases} \]

#1 Missing completely at random (MCAR)

  1. We say that the missing obesity data is MCAR if the cause of missingness is independent of any other variable (There are no edges between \(R_O\) and \(G, A, O\))
  2. Examples of how Obesity could be MCAR:
    1. Technical fault: eg. random connectivity downtime in the weighing machine that prevents the sending of patient’s weight to EMR
    2. Workflow: patient forgets to take his/her weight after questionnaire (and we could not call the patient to get his/her weight after)

Implications of MCAR

  1. Complete-case analysis \(\rightarrow\) unbiased dataset, though loss of statistical power due to fewer \(N\)
  2. Multiple imputation \(\rightarrow\) unbiased dataset
    1. Multiple imputation attempts to impute missing variables using a model of the partially observed variable (\(Y\)) on the observed variables (\(X\))
    2. Why unbiased? \(R_O \perp\!\!\!\!\!\!\perp O\) implies that \(P(O | A, G, R_O = 1) = P(O | A, G, R_O = 0)\)
    3. The distribution of missing obesity values (which we are trying to impute) is the same as the distribution of observed obesity values, given the other observed variables like age & gender
    4. So we can estimate the missing obesity values using a logistic regression model

#2 Missing at random (MAR)

  1. We say that the missing Obesity data is MAR if the cause of missingness is dependent on fully observed variable(s) eg. Age
  2. Conditional independence assertion:
    1. Missingness, \(R_O\), is conditionally independent of Obesity given Age i.e. \(R_O \perp\!\!\!\!\!\!\perp O | A\)
    2. Or you can think of it as: missingness, \(R_O\), is only dependent on Age
  3. Example of how Obesity could be MAR:
    1. A lot of missing weight data among teenagers who were rebellious and did not report their weight

Implications of MAR

  1. Complete-case analysis \(\rightarrow\) biased dataset
    1. Why biased? Data from teenage patients is excluded
  2. Multiple imputation \(\rightarrow\) unbiased dataset
    1. Why unbiased? Data from teenage patients is imputed and not excluded
    2. The conditional independence assertion above implies that \(P(O | A, G, R_O = 1) = P(O | A, G, R_O = 0)\)
    3. We can impute the missing data as the distribution of missing obesity values (which we are trying to impute) is the same as the distribution of observed obesity values, given the other observed variables like age & gender

#3 Missing not at random (MNAR)

  1. We say that the missing Obesity data is MNAR if the cause of missingness is dependent on the Obesity variable itself
  2. Example of how Obesity could be MNAR:
    1. A lot of missing weight data from obese patients who are embarrassed and did not disclose their weight
    2. In another setting, we often find that low/high income earners do not disclose their income

Implications of MNAR

  1. Complete case analysis \(\rightarrow\) biased dataset
    1. Why biased? Data from obese patients is excluded
  2. Multiple imputation \(\rightarrow\) biased dataset
    1. Why biased? The distribution of missing obesity values (which we are trying to impute) is different from the distribution of observed obesity values, given the other observed variables like age & gender i.e. \(P(O | A, G, R_O = 1) \ne P(O | A, G, R_O = 0)\)
    2. Imputing missing weight data for obese patients using data from patients who are not obese \(\rightarrow\) imputed weight tend to be lower than what it should actually be
  3. Imputation of missing data requires causal modelling of \(R\) variables. No pure statistical method of imputation exists20

Can these missingness mechanisms be tested for?

  1. MAR and MNAR cannot be tested for (see section 1.4 of Enders21)
    1. In each case, an independence assertion regarding the missing variable was made
    2. Since the missing variable is by definition partially unobserved, we cannot do a test for such an assertion where the input to the test is unobserved
  2. MCAR can be tested for (see section 1.9 of Enders22)
    1. There are testable propositions, the idea is to assume that the cases with missing data came from the same population and thus have the same means and covariances as the complete cases
    2. Little’s test is one test for MCAR that evaluates the mean differences across subgroups of cases that share the same missing data pattern
    3. The null hypothesis for Little’s test states that the data are MCAR, so a statistically significant test statistic provides evidence against the MCAR mechanism

Caveats about Little’s test

From section 1.9 of Enders22

  1. The test does not identify the specific variables that violate MCAR, so it is only useful for testing an omnibus hypothesis that is unlikely to hold in the first place
  2. The version of the test outlined above assumes that the missing data patterns share a common covariance matrix. MAR and MNAR mechanisms can produce missing data patterns with different variances and covariances, and the test statistic in Equation 1.4 would not necessarily detect covariance based deviations from MCAR
  3. Simulation studies suggest that Little’s test suffers from low power, particularly when the number of variables that violate MCAR is small, the relationship between the data and missingness is weak, or the data are MNAR. Consequently, the test has a propensity to produce Type II errors and can lead to a false sense of security about the missing data mechanism

Multiple imputation by chained equations (MICE)

Conclusion

  1. We have learnt about the 3 missingness mechanisms of MCAR, MAR, MNAR
  2. We have considered the implications of complete-case analysis and multiple imputation on bias & efficiency
  3. Challenge remains: a thorough handling of missing data requires causal modelling in order to:
    1. Distinguish between the 3 missingness mechanisms
    2. Impute missing data in MNAR scenario where common methods like complete-case analysis or multiple imputation are invalid

Presentation of results

Markdown

  1. Markdown is a lightweight markup language that you can use to add formatting elements to plaintext text documents. Created by John Gruber in 2004, Markdown is now one of the world’s most popular markup languages.1
  2. Try out the markdown live preview

R Markdown

  1. R Markdown is an extension to markdown that enables user to execute R code and display the code output in addition to text
  2. The knitr application enables us to execute R code and display output in a temporary markdown document
  3. Pandoc then converts the markdown document to the desired format (e.g. html, pdf, docx, pptx)
  4. See summary by Robin Linacre

Image from R Markdown cookbook chapter 2.1

[DEMO] Creation of an R markdown document

  1. Create an rmarkdown document

  2. Knit your document into html

Components of Rmarkdown document

  1. YAML header: controls the meta data e.g. Title, date, output document format, table of contents
  2. Formatted text of your descriptions, explanations etc.
  3. Code chunks: where you run R code e.g. plot graphs and display the output in the same document

Image taken from An Introduction to R

R markdown YAML header

---
title: "My project"
author: "CRU"
date: '`r format(Sys.Date(), "%d %b %Y")`'
output: 
  bookdown::html_document2:
    toc: true
    toc_float: true
bibliography: references.bib
csl: vancouver-superscript.csl 
link-citations: true
---
  1. This is a common setting which I use
  2. bookdown::html_document2 of the Bookdown package is a good output format because it takes care of formatting e.g. auto-numbering of sections, handling of references, appendix
  3. toc (line 7 & 8) refers to table of contents
  4. bibliography, csl and link-citations is for adding citations stored in a bibtex file into your document. More will be covered in a later slide

Consort diagram

  1. The consort package provides a convenient consort_plot function to plot CONSORT diagrams for reporting inclusion/exclusion/allocation (see the vignette)

    library(consort)
    flow_diagram <- consort_plot(data = penguins %>%
                                          mutate(id = row_number(),
                                                 exclusion = case_when(island == "Dream" ~ "Penguins on Dream island",
                                                                       year == 2007 ~ "Data collected in 2007",
                                                                       TRUE ~ NA_character_) %>%
                                                             fct_relevel("Penguins on Dream island",
                                                                         "Data collected in 2007")),
                                 orders = c(id = "Study population",
                                            exclusion = "Excluded",
                                            id = "Data for analysis"),
                                            side_box = "exclusion",
                                 cex = 1.2)
    plot(flow_diagram)

  2. For presenting in html documents, the plot function does not render the plot fully so when that happens, save it as an image first

    # Change the file path to where you want your image to be saved
    ggsave(here("images/consort_diagram.png"), plot = build_grid(flow_diagram))
  3. Next, load your image using knitr::include_graphics

    # You can use the fig.height or fig.width chunk option to rescale your image
    knitr::include_graphics(here("images/consort_diagram.png"))

Presenting tables

pacman::p_load(arsenal, knitr)

tableby(species ~ sex + island + bill_length_mm,
        data = penguins,
        control = tableby.control(numeric.stats = c("Nmiss", "meansd", "medianq1q3", "range"),
                                  digits = 2)) %>%
summary(text = TRUE) %>%
1knitr::kable(align = c("l", rep("r", 5)))
1
The kable function converts the table object to html
Adelie (N=152) Chinstrap (N=68) Gentoo (N=124) Total (N=344) p value
sex 0.976
- N-Miss 6 0 5 11
- female 73 (50.0%) 34 (50.0%) 58 (48.7%) 165 (49.5%)
- male 73 (50.0%) 34 (50.0%) 61 (51.3%) 168 (50.5%)
island < 0.001
- Biscoe 44 (28.9%) 0 (0.0%) 124 (100.0%) 168 (48.8%)
- Dream 56 (36.8%) 68 (100.0%) 0 (0.0%) 124 (36.0%)
- Torgersen 52 (34.2%) 0 (0.0%) 0 (0.0%) 52 (15.1%)
bill_length_mm < 0.001
- N-Miss 1 0 1 2
- Mean (SD) 38.79 (2.66) 48.83 (3.34) 47.50 (3.08) 43.92 (5.46)
- Median (Q1, Q3) 38.80 (36.75, 40.75) 49.55 (46.35, 51.08) 47.30 (45.30, 49.55) 44.45 (39.23, 48.50)
- Range 32.10 - 46.00 40.90 - 58.00 40.90 - 59.60 32.10 - 59.60

Formatting HTML tables

The kableExtra package provides useful functions to format tables e.g. changing column width, changing colors, font size, fixed header rows etc.

pacman::p_load(arsenal, knitr, kableExtra)

tableby(species ~ sex + island + bill_length_mm,
        data = penguins,
        control = tableby.control(numeric.stats = c("Nmiss", "meansd", "medianq1q3", "range"),
                                  digits = 2)) %>%
summary(text = TRUE) %>%
kable(align = c("l", rep("r", 5))) %>%
kableExtra::column_spec(column = 1, width_min = "4.75cm") %>%
kableExtra::column_spec(column = 2:5, width_min = "5.75cm") %>%
kableExtra::column_spec(column = 6, width_min = "2.25cm")
Adelie (N=152) Chinstrap (N=68) Gentoo (N=124) Total (N=344) p value
sex 0.976
- N-Miss 6 0 5 11
- female 73 (50.0%) 34 (50.0%) 58 (48.7%) 165 (49.5%)
- male 73 (50.0%) 34 (50.0%) 61 (51.3%) 168 (50.5%)
island < 0.001
- Biscoe 44 (28.9%) 0 (0.0%) 124 (100.0%) 168 (48.8%)
- Dream 56 (36.8%) 68 (100.0%) 0 (0.0%) 124 (36.0%)
- Torgersen 52 (34.2%) 0 (0.0%) 0 (0.0%) 52 (15.1%)
bill_length_mm < 0.001
- N-Miss 1 0 1 2
- Mean (SD) 38.79 (2.66) 48.83 (3.34) 47.50 (3.08) 43.92 (5.46)
- Median (Q1, Q3) 38.80 (36.75, 40.75) 49.55 (46.35, 51.08) 47.30 (45.30, 49.55) 44.45 (39.23, 48.50)
- Range 32.10 - 46.00 40.90 - 58.00 40.90 - 59.60 32.10 - 59.60

Formatting HTML table subheaders

If you added labels to your dataframe using the labelled package, it can be presented as subheaders via the labelTranslations argument in summary

labels <- list(sex = "Sex",
               island = "Island",
               bill_length_mm = "Bill length (mm)")
labelled::var_label(penguins) <- labels

tableby(species ~ sex + island + bill_length_mm,
        data = penguins,
        control = tableby.control(numeric.stats = c("Nmiss", "meansd", "medianq1q3", "range"),
                                  digits = 2)) %>%
1summary(text = TRUE, labelTranslations = map(labelled::var_label(penguins), ~ str_glue("**{.x}**"))) %>%
kable(align = c("l", rep("r", 5))) %>%
kableExtra::column_spec(column = 1, width_min = "5.5cm") %>%
kableExtra::column_spec(column = 2:5, width_min = "5.75cm") %>%
kableExtra::column_spec(column = 6, width_min = "2.25cm")
1
The ** will render the text bold in Rmarkdown html documents
Adelie (N=152) Chinstrap (N=68) Gentoo (N=124) Total (N=344) p value
**Sex** 0.976
- N-Miss 6 0 5 11
- female 73 (50.0%) 34 (50.0%) 58 (48.7%) 165 (49.5%)
- male 73 (50.0%) 34 (50.0%) 61 (51.3%) 168 (50.5%)
**Island** < 0.001
- Biscoe 44 (28.9%) 0 (0.0%) 124 (100.0%) 168 (48.8%)
- Dream 56 (36.8%) 68 (100.0%) 0 (0.0%) 124 (36.0%)
- Torgersen 52 (34.2%) 0 (0.0%) 0 (0.0%) 52 (15.1%)
**Bill length (mm)** < 0.001
- N-Miss 1 0 1 2
- Mean (SD) 38.79 (2.66) 48.83 (3.34) 47.50 (3.08) 43.92 (5.46)
- Median (Q1, Q3) 38.80 (36.75, 40.75) 49.55 (46.35, 51.08) 47.30 (45.30, 49.55) 44.45 (39.23, 48.50)
- Range 32.10 - 46.00 40.90 - 58.00 40.90 - 59.60 32.10 - 59.60

Cross-references & citations

  1. See cross-references and citations on how to add internal cross-references and citations from a bibtex file
  2. See bibliographies and citations on how to add a references and appendix section

Reflection questions

  • Can you provide a brief overview of your research and its main objectives?
  • What was your rationale for choosing this particular research topic?
  • How did you develop your research questions or hypotheses?
  • What gaps in the literature is your study intended to address?
  • Can you explain your chosen methodology and why it was appropriate for your study?
  • What were the main challenges you encountered during your research, and how did you overcome them?
  • How does your research contribute to the existing body of knowledge in your field?
  • Can you discuss the key theories or concepts that informed your research?
  • Were there any conflicting views in the literature, and how did you address them?
  • How did you ensure that your literature review was comprehensive and up-to-date?
  • Can you explain your data collection process and any ethical considerations?
  • What were the main findings of your research?
  • How did you analyze your data, and why did you choose those specific analytical methods?
  • How did you choose the variables in your model?
  • Were there any unexpected results, and how did you interpret them?
  • How do your findings relate to previous research in the field?
  • What are the practical implications of your research?
  • How might your findings be applied in real-world settings?
  • What are the limitations of your study, and how might they be addressed in future research?
  • If you were to continue this research, what would be your next steps?
  • Looking back, what would you do differently if you were to conduct this research again?
  • How has this research process contributed to your development as a researcher?
  • Can you identify any potential biases in your research and how you addressed them?
  • How do you see your research fitting into the broader context of your field?
  • Can you elaborate on any assumptions you made during your research?
  • How confident are you in your conclusions, and what evidence supports them?

References

1.
VideoLecturesChannel. A Flexible Model for Count Data: The COM-Poisson Distribution. 2012.
2.
Consortium R. Observation driven Conway-Maxwell Poisson count data models. 2018.
3.
Sellers KF, Premeaux B. Conway–MaxwellPoisson regression models for dispersed count data. WIREs Computational Statistics. 2020;13(6):e1533.
4.
Sellers KF, Shmueli G. A flexible regression model for count data. The Annals of Applied Statistics. 2010;4(2):943–61, 19.
5.
6.
Fung T, Alwan A, Wishart J, Huang A. Mpcmp: Mean-Parametrizied Conway-Maxwell Poisson Regression. 2020.
7.
Sim J, Wright CC. The Kappa Statistic in Reliability Studies: Use, Interpretation, and Sample Size Requirements. Physical Therapy. 2005 Mar;85(3):257–68.
8.
Field A, Miles J, Field Z. Discovering Statistics Using R. SAGE Publications; 2012.
9.
Tavakol M, Dennick R. Making sense of Cronbach’s alpha. Int J Med Educ [Internet]. 2011/06/27 ed. 2011 Jun;2:53–5. Available from: https://www.ncbi.nlm.nih.gov/pubmed/28029643
10.
McGraw KO, Wong SP. Forming inferences about some intraclass correlation coefficients. Psychological Methods. 1996;1(1):30–46.
11.
Koo TK, Li MY. A Guideline of Selecting and Reporting Intraclass Correlation Coefficients for Reliability Research. J Chiropr Med [Internet]. 2016/06/23 ed. 2016 Jun;15(2):155–63. Available from: https://www.ncbi.nlm.nih.gov/pubmed/27330520
12.
Qin S, Nelson L, McLeod L, Eremenco S, Coons SJ. Assessing test-retest reliability of patient-reported outcome measures using intraclass correlation coefficients: Recommendations for selecting and documenting the analytical formula. Qual Life Res [Internet]. 2018/12/14 ed. 2019 Apr;28(4):1029–33. Available from: https://www.ncbi.nlm.nih.gov/pubmed/30547346
13.
Gatsonis C, Morton SC. Methods in Comparative Effectiveness Research. Taylor & Francis; 2015.
14.
Jaksa A, Gibbs L, Kent S, Rowark S, Duffield S, Sharma M, et al. Using primary care data to assess comparative effectiveness and safety of apixaban and rivaroxaban in patients with nonvalvular atrial fibrillation in the UK: An observational cohort study. BMJ Open [Internet]. 2022/10/18 ed. 2022 Oct;12(10):e064662. Available from: https://www.ncbi.nlm.nih.gov/pubmed/36253039
15.
Lu AD, Gunzburger E, Glorioso TJ, Smith WB 2nd, Kenney RR, Whooley MA, et al. Impact of Longitudinal Virtual Primary Care on Diabetes Quality of Care. J Gen Intern Med [Internet]. 2021/01/24 ed. 2021 Sep;36(9):2585–92. Available from: https://www.ncbi.nlm.nih.gov/pubmed/33483815
16.
Hernan MA, Robins JM. Using Big Data to Emulate a Target Trial When a Randomized Trial Is Not Available. Am J Epidemiol [Internet]. 2016/03/20 ed. 2016 Apr;183(8):758–64. Available from: https://www.ncbi.nlm.nih.gov/pubmed/26994063
17.
Hernan MA, Sauer BC, Hernandez-Diaz S, Platt R, Shrier I. Specifying a target trial prevents immortal time bias and other self-inflicted injuries in observational analyses. J Clin Epidemiol [Internet]. 2016/05/31 ed. 2016 Nov;79:70–5. Available from: https://www.ncbi.nlm.nih.gov/pubmed/27237061
18.
Cote P, Negrini S, Donzelli S, Kiekens C, Arienti C, Ceravolo MG, et al. Introduction to target trial emulation in rehabilitation: A systematic approach to emulate a randomized controlled trial using observational data. Eur J Phys Rehabil Med [Internet]. 2024/02/29 ed. 2024 Feb;60(1):145–53. Available from: https://www.ncbi.nlm.nih.gov/pubmed/38420907
19.
Wang C, Kaufman JS, Steele RJ, Shrier I. Target trial framework for determining the effect of changes in training load on injury risk using observational data: A methodological commentary. BMJ Open Sport Exerc Med [Internet]. 2024/07/08 ed. 2024;10(3):e002037. Available from: https://www.ncbi.nlm.nih.gov/pubmed/38975026
20.
Mohan K, Pearl J. Graphical Models for Processing Missing Data. Journal of the American Statistical Association. 2021;116(534):1023–37.
21.
Enders CK. Applied Missing Data Analysis. Guilford Publications; 2022.
22.
Enders CK. Applied Missing Data Analysis. Guilford Publications; 2010.